feat(binary): add git binary patch support#61
feat(binary): add git binary patch support#61weihanglo wants to merge 12 commits intobmwill:masterfrom
Conversation
61c9c2b to
531e8d9
Compare
* Parse `diff --git` extended headers * split multi-file git diffs at `diff --git` boundaries
Compat test for also `git apply`.
Unlike unidiff, gitdiff produces patches for empty file creations/deletions (`0\t0` in numstat) because they carry `diff --git` + extended headers even without hunks. Binary files (`-\t-\t`) are skipped in gitdiff mode for now.
bmwill#62 optimized it
7101ee1 to
a2190d2
Compare
* Added types representing both literal and delta Git binary patches * Added a parser for the `GIT binary patch` format. This doesn't include the patch application (which will be added in later commits) The implementation is based on * Specification from <https://diffx.org/spec/binary-diffs.html> * Behavior observation of Git CLI
Returns the longest valid UTF-8 prefix of the input. For `str` this is the entire input; for `[u8]` it truncates at the first invalid byte. This will be used at call sites where generic `T: Text` input needs to be narrowed to `&str` for parsers that only handle ASCII data (e.g. binary patch base85 content).
The API was stabilized in 1.73. The lint was added in 1.93. This is required for a MSRV bump to 1.75
This is a preparation for binary diff application support. * Git binary patch is compressed by zlib hence flate2 * zlib-rs (which is the most performant zlib backend) requires MSRV 1.75.0+ hence the bump.
* Add base85 encoder/decoder and Git delta format decoder. * Wire them into `BinaryPatch::apply() and `apply_reverse()` for decoding zlib-compressed, base85-encoded binary payload. These are feature-gated behind the `binary` feature.
Now both tests require `binary` Cargo feature.
There was a problem hiding this comment.
We use our hand-written one. Alternatives:
- base85
- https://crates.io/crates/base85
- Same RFC 1924 alphabet
- MPL-2.0
- Doesn't have
decode_intoto avoid allocation
- base85rs
- https://crates.io/crates/base85rs
- Same RFC 1924 alphabet
decode -> Optionnot good for error reporting- Doesn't have
decode_intoto avoid allocation
- ascii85
- https://crates.io/crates/ascii85
- It's not Git base85
- z85
- https://crates.io/crates/z85
- ZeroMQ variant
There was a problem hiding this comment.
The hand-written one here is solid, and it is nice to avoid an extra dep if we can
| } else { | ||
| // ADD instruction | ||
| let add_len = control as usize; | ||
| let data = cursor.read_bytes(add_len)?; | ||
| result.extend_from_slice(data); | ||
| } |
There was a problem hiding this comment.
Its been a while since i've implemented the delta compression algorithm but I think that this may be mis handling the case where control == 0x00 which is reserved by by git but here we're treating it as an ADD of 0 bytes.
Although rereading the diffx spec you linked at the top of this file 0x00 doesn't seem to be called out as special or reserved so we could probably go either way.
There was a problem hiding this comment.
Good catch! This is definitely worth a compat test. I dare not reading GPL source to confirm that 😨.
| }); | ||
| } | ||
|
|
||
| let mut result = Vec::with_capacity(header_mod_size as usize); |
There was a problem hiding this comment.
This can be possibly dangerous as we're reading a u64 value from a possibly malformed set of delta instructions which could lead to some inputs causing an instant OOM. We probably want to try and cap the amount we're willing to allocate up front in order to avoid small adversarial inputs from crashing the process.
| fn decode_data(binary_data: &BinaryData<'_>) -> Result<Vec<u8>, BinaryPatchParseError> { | ||
| use std::io::Read; | ||
|
|
||
| let compressed = decode_base85_lines(binary_data.data)?; | ||
|
|
||
| let mut decoder = flate2::read::ZlibDecoder::new(&compressed[..]); | ||
| let mut decompressed = Vec::new(); | ||
| decoder | ||
| .read_to_end(&mut decompressed) | ||
| .map_err(|e| BinaryPatchParseErrorKind::DecompressionFailed(e.to_string()))?; | ||
|
|
||
| Ok(decompressed) | ||
| } |
There was a problem hiding this comment.
We don't seem to do any checks that the decompressed data matches the size that is present in BinaryData.size. Do we want to do an extra check for that to ensure the decompressed data is the expected size?
Awesome! hopefully getting all this in wasn't too painful. I have a few smaller things that I think would be nice to add or polish myself so hopefully i can find the time to do that before we are ready to do a release, although I won't hold up a release if i don't get around to those things. |
Feel free to dump your thoughts in an issue. I can help clean them up! Here is a rough list in my mind:
|
Blocked on #59. Please review from 58f95d7
Basically a port of weihanglo#33.
BTW, this is probably the very last major feature PR. I might have some other small ones for housekeeping and preparation of the new release.
Add parsing and application of git binary patches,
including both literal and delta formats.
This feature is behind the
binaryCargo feature.The implementation is based on
Some others highlights:
binaryfeature.